Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming
نویسندگان
چکیده
Notation Connection with Abstract DPMapping of a stationary policy μ: For any control function μ, with μ(x) ∈ U(x) forall x , and J ∈ E(X ) define the mapping Tμ : E(X ) 7→ E(X ) by(TμJ)(x) = E{g(x , μ(x),w) + αJ(f (x , μ(x),w))}, x ∈ XValue Iteration mapping: For any J ∈ E(X ) define the mapping T : E(X ) 7→ E(X )(TJ)(x) = infu∈U(x)E{g(x , u,w) + αJ(f (x , u,w))}, x ∈ XNote that Bellman’s equation is J = TJ and VI starting from J is T k J, k = 0, 1, . . .Abstract notation relating to regularityWe have(Tμ0 · · ·TμN−1 J)(x0) = E{αJ(xN) +N−1∑k=0α g(xk , μk (xk ),wk)} C is S-regular ifJπ(x) = lim supN→∞(Tμ0 · · ·TμN J)(x), ∀ (π, x) ∈ C, J ∈ S Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming12 / 33 Upper Bounding the Fixed Points of T J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) = Jp(2) = 1 For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp(1) = 0, Jp(2) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Optimal cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, Cost 0 u, Cost bProb. p Prob. 1− p Stationary policy costs Prob. u Prob. 1− u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)J∗ Jμ Jμ′ Jμ′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk(x) Fμk+1(x)Improper policy μProper policy μ 1J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) = Jp(2) 1 For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp(1) = 0, Jp(2) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Optimal cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, 0 u, Cost bProb. p Prob. 1− p Stationary policy costs Prob. u Prob. 1− u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)J∗ Jμ Jμ′ Jμ′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk) f(x; θk+1 xk F ( ) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk Fμk+1(x)Improper policy μProper policy μ 1J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) = Jp(2) = For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp( ) = 0, Jp( ) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Optim l cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, Cost 0 u, Cost bProb. p Prob. 1− p Stationary policy costs Prob. u Prob. 1− u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)∗ Jμ ′′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk(x) Fμk+1(x)Improper policy μProper policy μ 1J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) = Jp(2) = 1 For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp(1) = 0, Jp(2) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Optimal cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, Cost 0 u, Cost bProb. p Prob. 1− p Stationary policy costs Prob. u Prob. 1− u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)J∗ Jμ Jμ′ Jμ′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk(x) Fμk+1(x)Improper policy μProper policy μ 1J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) = Jp(2) = 1 For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp(1) = 0, Jp(2) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Optimal cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, Cost 0 u, Cost bProb. p Prob. 1− p Stationary policy costs Prob. u Prob. 1− u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)J∗ Jμ Jμ′ Jμ′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk(x) Fμk+1(x)Improper policy μProper policy μ 1J J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) Jp(2) = 1 For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp(1) = 0, Jp(2) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Opti al cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, Cost 0 u, Cost bProb. p Prob. 1− p S ationary policy co ts Prob. u Prob. 1 u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)∗ Jμ Jμ′ Jμ′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk) f(x; θk+1) xk ) F (x) k+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk(x) Fμk+1(x)Improper policy μProper policy μ 1Fixed Point of T VI Optimal Cost over C 1Fixed Point of T VI Optimal Cost over C 1Fixed Point of T VI Optimal Cost over C E(X) 1Fixed Point of T VI: T kJ Optimal Cost over C E(X) 1Let C be an S-Regular CollectionFor all fixed points J ′ of T , and all J ∈ E(X ) such that J ′ ≤ J ≤ Ĵ for some Ĵ ∈ S,J ′ ≤ lim infk→∞T k J ≤ lim supk→∞T k J ≤ JC If in addition J∗C is a fixed point of T (a common case), then ∗C is the largest fixedpoint Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming13 / 33 Characterizing VI Convergence J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) = Jp(2) = 1 For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp(1) = 0, Jp(2) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Optimal cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, Cost 0 u, Cost bProb. p Prob. 1− p Stationary policy costs Prob. u Prob. 1− u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)J∗ Jμ Jμ′ Jμ′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk(x) Fμk+1(x)Improper policy μProper policy μ 1J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) = Jp(2) 1 For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp(1) = 0, Jp(2) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Optimal cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, 0 u, Cost bProb. p Prob. 1− p Stationary policy costs Prob. u Prob. 1− u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)J∗ Jμ Jμ′ Jμ′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk) f(x; θk+1 xk F ( ) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk Fμk+1(x)Improper policy μProper policy μ 1J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) = Jp(2) = For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp( ) = 0, Jp( ) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Optim l cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, Cost 0 u, Cost bProb. p Prob. 1− p Stationary policy costs Prob. u Prob. 1− u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)∗ Jμ ′′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk(x) Fμk+1(x)Improper policy μProper policy μ 1J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) = Jp(2) = 1 For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp(1) = 0, Jp(2) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Optimal cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, Cost 0 u, Cost bProb. p Prob. 1− p Stationary policy costs Prob. u Prob. 1− u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)J∗ Jμ Jμ′ Jμ′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk(x) Fμk+1(x)Improper policy μProper policy μ 1J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) = Jp(2) = 1 For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp(1) = 0, Jp(2) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Optimal cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, Cost 0 u, Cost bProb. p Prob. 1− p Stationary policy costs Prob. u Prob. 1− u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)J∗ Jμ Jμ′ Jμ′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk) f(x; θk+1) xk F (xk) F (x) xk+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk(x) Fμk+1(x)Improper policy μProper policy μ 1J J ′J∗C Ĵ ∈ S Limit Region Valid Start RegionFor p = 1: Jp(1) Jp(2) = 1 For p = 0: Jp(1) = Jp(5) = 2For p = 1/2 (which is optimal): Jp(1) = 0, Jp(2) = 1, Jp(5) = 2Jμ(1) = b, Jμ′(1) = 0Opti al cost J∗(1) = min{b, 0} a 0 1 2 t b c u′, Cost 0 u, Cost bProb. p Prob. 1− p S ationary policy co ts Prob. u Prob. 1 u Cost 1 Cost 1−√u J(1) = min{c, a + J(2)} J(2) = b + J(1)∗ Jμ Jμ′ Jμ′′Jμ0 Jμ1 Jμ2 Jμ3 Jμ0f(x; θk) f(x; θk+1) xk ) F (x) k+1 F (xk+1) xk+2 x∗ = F (x∗) Fμk(x) Fμk+1(x)Improper policy μProper policy μ 1Fixed Point of T VI Optimal Cost over C 1Fixed Point of T VI Optimal Cost over C 1Fixed Point of T VI Optimal Cost over C E(X) 1Fixed Point of T VI: T kJ Optimal Cost over C E(X) 1VI-Related PropertiesIf J∗C is a fixed point of T , then VI converges to J ∗C starting from any J ∈ E(X ) suchthat J∗C ≤ J ≤ Ĵ for s me Ĵ ∈J∗ does not enter the picture! It is possible that VI converges to J∗C and not to J ∗(which may not even be a fixed point of T )When J∗ is a fixed point of T , a useful analytical strategy is to choose C such thatJ∗C = J ∗. Then a VI convergence result is obtained Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming14 / 33 Nonnegative Cost Optimal ControlCost nonnegativity, g ≥ 0, provides a favorable structure (Strauch 1966)J∗ is the smallest fixed point of T within E(X )VI converges to J∗ starting from 0 under some mild compactness conditionsRegularity-based analytical approachDefine a collection C such that J∗C = J∗Define a set S ⊂ E(X ) such that C is S-regularUse the main result in conjunction with the fixed point property of J∗ to show thatJ∗ is the unique fixed point of T within SUse the main result to show that the VI algorithm converges to J∗ starting from Jwithin the set {J ∈ S | J ≥ J∗}Enlarge the set of functions starting from which VI converges to J∗ using acompactness conditionWe use this approach in three major applications Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming16 / 33 Application to Nonnegative Cost Deterministic Optimal ControlClassic problem of regulation to a terminal setSystem: xk+1 = f (xk ,uk ). Cost per stage: g(xk ,uk ) ≥ 0Cost-free and absorbing terminal set of states Xs that we aim to reach or approachasymptotically at minimum costAssumptionsJ∗(x) > 0 for all x /∈ XsControllability: For all x with J∗(x) <∞ and > 0, there exists a policy π thatreaches (in a finite number of steps) Xs starting from x with cost Jπ(x) ≤ J∗(x) +DefineC ={(π, x) | J∗(x) <∞, π reaches Xs starting from x}S ={J ∈ E(X ) | J(x) = 0, ∀ x ∈ Xs}ResultsJ∗ is the unique solution of Bellman’s equation within SVI converges to J∗ starting from any J0 ∈ S with J0 ≥ J∗ (and for any J0 ∈ S undera compactness condition)Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming17 / 33 Application to Nonnegative Cost Stochastic Optimal ControlProblemSystem: xk+1 = f (xk ,uk ,wk )Cost per stage: g(xk ,uk ,wk ) ≥ 0DefineC ={(π, x) | Jπ(x) <∞}; so J∗C = J ∗S ={J ∈ E(X ) | Ex0{J(xk )}→ 0, ∀ (π,x0) ∈ C}ResultsJ∗ is the unique solution of Bellman’s equation within SVI converges to J∗ starting from any J0 ∈ S with J0 ≥ J∗ (and for any J0 ∈ S undera compactness condition)An interesting consequence (Yu and Bertsekas, 2013)If a function J ∈ E(X ) satisfies J∗ ≤ J ≤ cJ∗ for some c ≥ 1, VI converges to J∗starting from J Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming18 / 33 Application to Discounted Nonnegative Cost Stochastic Optimal ControlThe problem with discount factor α < 1Terminology and definitionsXf ={x ∈ X | J∗(x) <∞}π is stable from x0 ∈ Xf if there is bounded subset of Xf s.t. the sequence{xk}generated starting from x0 and using π lies with probability 1 within that subsetC ={(π, x) | x ∈ Xf , π is stable from x}J ∈ E(X ) is bounded on bounded subsets of Xf if for every bounded subsetX̃ ⊂ Xf there is a scalar b such that J(x) ≤ b for all x ∈ X̃S ={J ∈ E(X ) | J is bounded on bounded subsets of Xf}AssumptionC is nonempty, J∗ ∈ S, and for every x ∈ Xf and > 0, there exists a policy π that isstable from x and satisfies Jπ(x) ≤ J∗(x) +ResultsJ∗ is the unique solution of Bellman’s equation within SVI converges to J∗ starting from any J0 ∈ S with J0 ≥ J∗ (and for any J0 ∈ S undera compactness condition)Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming19 / 33 S-Regular Collections Involving Stationary PoliciesDefinitions: For a nonempty set of functions S ⊂ E(X )We say that a stationary policy μ is S-regular if T kμJ → Jμ for all J ∈ SEquivalently, μ is S-regular if the set C ={(μ, x) | x ∈ X}is S-regularLetMS be the set of policies that are S-regular, and defineJS (x) = infμ∈MSJμ(x), ∀ x ∈ XEquivalently, J∗S = J ∗C when C =MS × XVI Convergence ResultGiven a set S ⊂ E(X ), assume thatThere exists at least one S-regular policyJ∗S is a fixed point of TThen T k J → J∗S for every J ∈ E(X ) such that J∗S ≤ J ≤ Ĵ for some Ĵ ∈ S. Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming21 / 33 Policy IterationDefinitions:Standard PI: Tμk+1 Jμk =TJμkOptimistic PI: Tμk Jk =TJk , Jk+1 = T mkμk Jk (evaluation of the current policy isapproximate, using mk iterations of VI)Convergence of standard PI, assuming J∗ ≥ 0The sequence {μ} satisfies Jμk ↓ J∞, where J∞ is a fixed point of T with J∞ ≥ J∗If for a set S ⊂ E(X ), the policies μ generated are S-regular and we haveJμk ∈ S for all k , then Jμk ↓ J∗S and J∗S is a fixed point of TConvergence of optimistic PIThe sequence{Jk} satisfies satisfies Jk ↓ J∞, where J∞ is a fixed point of TIf for a set S ⊂ E(X ), the policies μ generated are S-regular and we haveJμk ∈ S for all k , then Jk ↓ J∗S and J∗S is a fixed point of TWith more analysis and conditions, we can show that J∞ = J∗. This is true for thedeterministic and stochastic nonnegative cost problems. Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming22 / 33 Stochastic Shortest Path ProblemsProblem FormulationFinite state space X = {0, 1, . . . , n} with 0 being a cost-free and absorbing stateTransition probabilities pxy (u)U(x) is finite for all x ∈ XNo discounting (α = 1)Proper policiesμ is proper if the terminal state t is reached w.p.1 under μ (is improper otherwise)Let S = <. Then μ is S-regular if and only if it is proper. (The idea of an S-regularpolicy evolved as a generalization of a proper policy.)Contraction propertiesThe mapping Tμ of a policy μ is a weighted sup-norm contraction iff μ properIf all stationary policies are proper, then T is a sup-norm contraction, and theproblem behaves like a discounted problemSSP is a prime example of a semicontractive model (some policies correspond tocontractions while others do not) Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming24 / 33 Stochastic Shortest Path Problems ResultsCase where improper policies have infinite costIf there exists a proper policy and for every improper μ, Jμ(x) =∞ for some x , then:J∗ is the unique fixed point of T in 0 to g, with δk ↓ 0) converges to an optimalpolicy within the class of proper policies, if started with a proper policyAn improper policy may be (overall) optimal, while J∗ need not be a fixed point of T Bertsekas (M.I.T.)Regular Policies in Stochastic Optimal Control and Abstract Dynamic Programming25 / 33
منابع مشابه
Stochastic Dynamic Programming with Markov Chains for Optimal Sustainable Control of the Forest Sector with Continuous Cover Forestry
We present a stochastic dynamic programming approach with Markov chains for optimal control of the forest sector. The forest is managed via continuous cover forestry and the complete system is sustainable. Forest industry production, logistic solutions and harvest levels are optimized based on the sequentially revealed states of the markets. Adaptive full system optimization is necessary for co...
متن کاملAn Application of the Stochastic Optimal Control Algorithm (OPTCON) to the Public Sector Economy of Iran
In this paper we first describe the stochastic optimal control algorithm called ((OPTCON)). The algorithm minimizes an intertemporal objective loss function subject to a nonlinear dynamic system in order to achieve optimal value of control (or instrument) variables. Second as an application, we implemented the algorithm by the statistical programming system ((GAUSS)) to determine the optimal fi...
متن کاملModelling and Decision-making on Deteriorating Production Systems using Stochastic Dynamic Programming Approach
This study aimed at presenting a method for formulating optimal production, repair and replacement policies. The system was based on the production rate of defective parts and machine repairs and then was set up to optimize maintenance activities and related costs. The machine is either repaired or replaced. The machine is changed completely in the replacement process, but the productio...
متن کاملA Multi-Stage Single-Machine Replacement Strategy Using Stochastic Dynamic Programming
In this paper, the single machine replacement problem is being modeled into the frameworks of stochastic dynamic programming and control threshold policy, where some properties of the optimal values of the control thresholds are derived. Using these properties and by minimizing a cost function, the optimal values of two control thresholds for the time between productions of two successive nonco...
متن کاملRegular Policies in Abstract Dynamic Programming
We consider challenging dynamic programming models where the associated Bellman equation, and the value and policy iteration algorithms commonly exhibit complex and even pathological behavior. Our analysis is based on the new notion of regular policies. These are policies that are well-behaved with respect to value and policy iteration, and are patterned after proper policies, which are central...
متن کاملOptimization Model of Hirmand River Basin Water Resources in the Agricultural Sector Using Stochastic Dynamic Programming under Uncertainty Conditions
In this study, water management allocated to the agricultural sector’ was analyzed using stochastic dynamic programming under uncertainty conditions. The technical coefficients used in the study referred to the agricultural years, 2013-2014. They were obtained through the use of simple random sampling of 250 farmers in the region for crops wheat, barley, melon, watermelon and ruby grapes under ...
متن کامل